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BAYESIAN ANALYSIS OF VARIABLE-ORDER, 
REVERSIBLE MARKOV CHAINS^ 

By Sergio Bacallado 

Stanford University 

We define a conjugate prior for the reversible Markov ciiain of 
order r. The prior arises from a partially exchangeable reinforced 
random walk, in the same way that the Beta distribution arises from 
the exchangeable Polya urn. An extension to variable-order Markov 
chains is also derived. We show the utility of this prior in testing the 
order and estimating the parameters of a reversible Markov model. 

1. Introduction. Reversible Markov chains are central to a number of 
fields. They underlie problems in applied probability like card-shuffling and 
queueing networks [1, 13] and pervade computational statistics through the 
many variants of Markov chain Monte Carlo; in physics, they are natural 
stochastic models for time-reversible dynamics. However, the notion of re- 
versibility in stochastic proscesses with memory is not as widely discussed, 
and statistical problems like testing the order of a reversible process remain 
a challenge. 

We define a conjugate prior for higher-order, reversible Markov chains, 
which extends a prior for reversible Markov chains by Diaconis and RoUes [10]. 
We begin by defining reversibility in a more general setting and motivating 
the significance of higher-order processes. In Section 2, we present two graph- 
ical representations for an order-r, reversible Markov chain, which are used 
in Section 3 to derive the conjugate prior via a random walk with rein- 
forcement. We dedicate Section 4 to variable-order Markov chains, a family 
of models that avoids the curse of dimensionality associated with higher- 
order Markov chains, proving essential in certain applications. Finally in 
Section 5, we discuss properties of the prior pertaining to Bayesian analy- 
sis. In examples, we test the extent of memory of a lumped Markov chain 
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and discretized molecular dynamics trajectories, and compare the posterior 
inferences of different models. 



Definition 1.1. A stochastic process X = Xn,n G N, with distribution 
P is called reversible, if for any ?n > n > 0, 

P{Xi,X2, . ■ . ,Xn) = P{Xjn-l, Xm-2, ■ ■ ■ )Xm-n)- 

It is not difficult to show that reversibility implies stationarity [13]; if 
stationarity is given, the above condition need only be checked for m = n + l. 
Now suppose X is an order-r, irreducible Markov chain taking values in 
a finite set X. We will also apply the term reversible to this process when 
the stationary chain satisfies the reversibility condition. 

Proposition 1.2. Let P be the stationary law of the order-r Markov 
chain X . If P{Xi, . . . ,Xr-\-i) = P{Xr-\-i, ■ ■ ■ ,-^^1), then the Markov chain is 
reversible. 

Proof. It is not difficult to check that the hypothesis together with 
stationarity imply P{Xi, . . . , Xn) = P{Xn, ■ ■ ■ , Xi) for any n < r + 1. For 
any n > r + 1 : 

n 

P{Xi, . . . ,Xn) = P{Xi, . . . ,Xr+l) P{Xi\Xi-r, ■ ■ ■ ,Xi-i) 

i=r+2 

P{X2, . . . ,Xr-\-2) P{Xn-r, ■ ■ ■ ,Xn) 



P{Xi, . . .,Xr+l] 
P{Xnj • ■ • J Xn—T 



P{X2, ■ ■ ■ ,Xj-\-l) P{Xn-r, ■ ■ ■ , Xn~l) 
, P{Xn-l, • • • , Xn~r~l) P{Xr+l, • • • , ^l) 



P{Xn-l, ■ ■ ■ ,Xn-r) P{Xr+l, ■ ■ ■ , X2) 

= P{Xn, ■ ■ ■,Xi), 

where we have used the Markov property, stationarity and the hypothesis. 
□ 



As a first remark, note that X„,n G N, can be represented as a first-order 
Markov chain Vn, n G N, taking values in the space of sequences . However, 
the reversibility of X does not imply the reversibility of its first-order repre- 
sentation; therefore, the analysis of higher-order reversible Markov chains re- 
quires novel techniques. In the following sections, we often use the first-order 
representation V^,n G N, referring to it nonetheless as an order-r Markov 
chain and using the notion of reversibility associated with the order-r Markov 
chain. 
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Fig. 1. A set of weighted circuits on the set X = {a,b,c,d,e, f,g}. In a circuit process 
started at u in A"" , we transition on some circuit that contains u with probability propor- 
tional to its weight. 

Secondly, we recall that Kolmogorov's criterion is another necessary and 
sufficient condition for the reversibility of a Markov chain, which only de- 
pends on the conditional transition probabilities [13]. Its equivalence to Def- 
inition 1.1 in the higher-order case is proven in the Appendix. Kolmogorov's 
criterion requires that the probability of traversing any cycle in either direc- 
tion is the same. Accordingly, a reversible Markov chain can be interpreted 
as a process with no net circulation in space. 

Reversibility is preserved under certain transformations. For example, let 
Xn,n € N, be a stationary, reversible Markov chain and consider a finitely 
valued function, f{Xn),n G N. It is easy to check that this process is station- 
ary and reversible, even though it may not be a Markov chain of any finite 
order. Functions or projections of reversible Markov chains appear under 
different guises in physics and other fields, and in many cases the effects of 
memory subside with time, motivating the use of finite order models. The 
problems of determining the order and estimating the parameters of Markov 
models have been studied extensively; here, we address these problems with 
the constraint of reversibility. 

2. Graphical representations of reversible Markov chains. For any se- 
quence u£ X^, let u* be its inverse, A(u) the subsequence obtained by delet- 
ing its last element and n{u) the one obtained by deleting its first element. 
We call ui,U2, ■ ■ ■ ,Un with Ui G an admissible path if Q{ui) = A(ni+i) 
for all 1 < i < n. The concatenation of these sequences without repeated 
overlaps is denoted ui • • • € p^^s+n-i^ 

The first representation we will consider is the circuit process of Mac- 
Queen [14]. Let a circuit be a periodic function on Af, and consider a class 
of positively weighted circuits ^ (for an example, see Figure 1). 
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Fig. 2. A de Brmjn graph of order 2 on the state space X = {a,6,c}. In a reversible 
random walk, the two highlighted edges have the same weight. 

Definition 2.1. A circuit process of order r is a Markov chain of the 
same order, where the transition probabihty from u S X'' to any v S with 
= A(f ) is given by 

where > is the weight of circuit 7, and the function J'y(-) counts the 
number of times that the circuit traverses a sequence in one period. In other 
words, in each step we move along some circuit in ^ containing the current 
state with probabihty proportional to its weight. The process only visits 
states that appear in the circuits, for which transition probabilities are well 
defined. 

An irreducible order-r Markov chain with stationary law P-,^ is parametri- 
zed by P-w{u) for all u G X^~^^. One can check that in a circuit process, this is 
just Pn{u) = '^^^'^w^J.y{u). MacQueen showed that any order-r Markov 
chain can be represented as a circuit process on a finite set which is not 
unique [14]. This is true in particular when the chain is reversible. 

We introduce a second graphical representation that is canonical, unlike 
the circuit process. Consider a de Bruijn graph on the vertices X^, which has 
a directed edge from u to u if and only if Q(u) = A(f ). That is, every path 
on the graph is an admissible path. For an example, see Figure 2. Assign 
a weight kuv > to each edge, and let ku be the summed weights of edges 
departing from u. Furthermore, require that 

(1) = ky*u* for every edge uv, 
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(2) ku = ku* for all u € and 

(3) ^ fc„ = 1. 

Definition 2.2. The reversible random walk of order r is a random walk 
on such a graph, with transition probabilities 

h 

V{v\u) = —. 

Proposition 2.3. An irreducible, reversible random walk of order r 
represents a reversible Markov chain of the same order. Every irreducible, 
reversible order-r Markov chain is equivalent to a unique reversible random 
walk of order r. 

Proof. Let vr be the stationary distribution of the random walk. To 
prove the first statement, we will first verify that 7r(ti) = ku for all u € . 
Let p{u\v) be the transition probability from to n in the random walk, and 
recah that n{u) = A{v) iff n{v*) = A(n*), then 

kuv 
ku 



/ ^ 7r(n)p(v|n) = ^ ku- 



ueX^ {u£X^ : n{u)=A{v)} 

= E 

{ueX^ : Q{v*)=A(u*)} 



kv'u* — k^* — ky — 7r(f). 



Then, the stationary law in the random walk of a path u,v is just 

k 

Pn{u,v) = Tr{u)p{v\u) = ku-^ =kuv, 

which implies that = kuv = k^'u* = Pn{v* ,u*). Therefore, the X-va- 

lued, order-r Markov chain represented by the random walk satisfies the 
reversibility condition in Proposition 1.2. Proving the second statement is 
now straightforward. Let V^,n S N, be the first-order representation of an 
irreducible, order-r Markov chain, with transition probabilities p'{v\u). By 
the Perron-Frobenius theorem, V has a unique stationary distribution vr'. 
Assign edge weights to the de Bruijn graph on X'' , setting km, = vr'(n)p'(z;|n). 
Since the order-r Markov chain is reversible, it follows directly from Propo- 
sition 1.2 that the edge weights satisfy conditions (l)-(3). □ 

3. From a reinforced random walk to the conjugate prior. An edge-re- 
inforced random walk (ERRW) is a random walk on an finite, undirected 
graph, where every edge- weight is increased by 1 each time it is crossed. Since 
Diaconis and Coppersmith defined this process [9], we have learned that it 
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is partially exchangeable and, by de Finetti's theorem for Markov chains, 
a mixture of Markov chains [8] . The mixing measure, which lives on the space 
of reversible Markov chains, was more recently characterized in the literature 
[12]. Diaconis and Rolles showed that this distribution is a conjugate prior 
for the reversible Markov chain, much as the Beta distribution, arising from 
a Polya urn scheme, is a conjugate prior for sequences of i.i.d. binary random 
variables [10]. 

Here, we construct a conjugate prior for higher-order reversible Markov 
chains via a reinforced random walk in , making use of de Finetti's theo- 
rem for Markov chains. This process is markedly different from an ERRW in 
due to the structure of a reversible Markov chain with memory, although 
it is designed to be partially exchangeable. 

Let a be any sequence on X and v a sequence shorter than a. Define the 
function J^iv), which counts the number of times that v appears in a, and 
Ja{v), which counts the number of times that v appears in a followed by at 
least one state. Fix w, a stationary measure for an irreducible, reversible, 
order-r Markov chain. Also fix vq E . Let /3 be a palindromic sequence 
that starts with vq and ends with Vq. Choose a positive constant c, such 
that for all u € w{u) — cJ'i^{u) > 0. Now, given a sequence rj, starting 

with vq, and any sequence v, define the functions 



When rj represents the path of a stochastic process in X'^ up to time n (for- 
mally, T] = Vq - ■ ■ Vn), wc will usc the notation 'w'^{v) = w'{r],v) and w'^{v) = 
w"{r], v). 

Definition 3.1. The reinforced random walk of order r is a stochastic 
process € N, on X'^ with distribution Qw,vo- The initial state is vq with 
probability 1. For any admissible path vq, . . . ,Vn, the conditional transition 
probability 



whenever Vn,u is admissible and zero otherwise. 

Remark 3.2. The law Qw,vo ^-Iso depends on /3 and c. These parameters 
are constant in the following discussion, so they are omitted from the no- 
tation for conciseness. When r = 1, this process is equivalent to an ERRW. 
In this case, the palindrome is unnecessary because the terms involving /3 
in w' and w" can be modeled with a different w. For r > 2, this is not the 
case, and /3 is essential for partial exchangeability (see Proposition 3.5). 



(4) 
(5) 



w'{r], v) = w{v) + c{J^{v) + J^* (v) - J'p{v)) and 
w"{r,, v) = w{v) + c{J''{v) + J'' {v) - J''{v)). 



Qw,vo {Yn+l = u\Yo =Vo,...,Yn = Vn) 



w'n{VnU) 
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Tj* : Xl X2 x^ X/l 

P : Xl ' * X2 ' ^ X-i X2 " ^ Xl 

Fig. 3. Auxiliary sequences m the order-r reinforced random walk. 

Remark 3.3. This process admits an interpretation as a reinforcement 
scheme of the circuit process. Consider a circuit process of order r with 
stationary probabihty w{u) = ^^^^^^WyJ'yiu) for ah u G X^''^^ . In addition, 
consider three weighted sequences: the pahndrome /3, a sequence 77 that 
represents the path of the reinforced process from the initial state vq up to 
the current state, and the reversed path 77* . These are depicted in Figure 3 
along with their weights — c, c and c, respectively. As in the circuit process, 
we move along any circuit or sequence that contains the current state with 
probability proportional to its weight. The reinforcement is accomplished by 
elongating the paths r] and r/*. 

Remark 3.4. The process is also a reinforcement scheme of a modified 
reversible random walk of order r. Consider a weighted de Bruijn graph, 
where for every admissible u,v, kuv = w{uv). Then, for every uv in the 
palindrome /3, subtract c from k^v The reinforcement scheme will consist of 
a random walk on the resulting graph, where after every transition Vi — > f j+i 
we increase both ky^^i+i and ky*^_^v* by c. Accordingly, if I'^Ui+T is a palin- 
drome, the weight k^^v^^i is increased by 2c. 

Proposition 3.5. The reinforced random walk of order r is partially ex- 
changeable in the sense of Diaconis and Freedman [8]. 

Proof. We must show that the probability Qw,vo{vo, ■ ■ ■ , Vn) of any ad- 
missible path vq, . . . ,Vn is a function of the initial state vq and the transition 
counts between every pair of states. For any pair u, v in X^' with A{v) = il.{u), 
let C{u,v) be the total number of transitions u—^v, and v* ^ u* . We will 
show the stronger statement that vq and C are sufficient statistics for the 
reinforced random walk. 

Let us first establish some properties that are conserved in the process. For 
every u € X^"^^, the initial weights Wq{u) and Wq{u*) are equal. This is direct 
from the definition in equation (4) because: w defines a reversible Markov 
chain of order r; the functions J' and J^, are zero for both u and u*; and f3 
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is a palindrome, so if it contains u, it also contains u* , and J'p{u) = J'p{u*). 
This property is maintained after every transition Vn — )> i^n+i, because the 
weights may both be increased by c if VnVn+i is u or «*, or both remain 
constant otherwise. 

For every v ^vq in X'^ , the initial weights Wq{v) = Wq{v*). This is direct 
from equation (5) because: w is reversible, both J^'^ and J^', are zero for v 
and f *, and the sequence /3 is a palindrome, so for every transition starting at 
V there will be another starting from v* . The last fact is not necessarily true 
for because unless vq itself is a palindrome, /3 will contain a transition 
starting from it, but no transition starting from Vq. So, in the beginning, 
■"^0(^0) = ""^0(^0) ~ ^- When a transition occurs from vq to vi, the weights 
w'{{vo) and w'({vq) become equal, while w'{{vi) = w'{{vl) — c, provided vi 
is not a palindrome. Hence, this singularity is preserved by the last state 
visited by the process. 

The probability Qw,voi'^Oy ■ ■ ■ jVn) is a ratio of two products. In the nu- 
merator, we find a factor of the form w[(llv) for every admissible transition 
V, while in the denominator, we find a corresponding weight w'l{u). It 
is easy to check that the numerator is only a function of C. Every transi- 
tion n — )• V or V* — )• li* adds a new factor of w[{uv), which is always greater 
than the previous one by c. If m] is a palindrome, then every new factor 
of w'l^wv) is increased by 2c. So, the numerator can be computed from the 
initial weights and C. 

We have left to show that the denominator is only dependent on vq and 
C. Note that the transition counts from v or v* are a function of C and vq, 
because every event v \s a, transition from -y, while every event u* v* 
is followed by a transition from u*, unless this is the final state, which is 
determined by vq. After every transition from v or v* , we add a factor of 
w'liv) or w"{v*) to the denominator. At any time t, these weights differ by 
c (if V is not a palindrome), but the factor added is always the smaller of 
the two. Between two transitions, each of these weights is reinforced by c, so 
consecutive factors differ by that amount. If -y is a palindrome, there is no 
distinction between w'l{v) and w'l{v*), and consecutive factors differ by 2c. 
□ 

Lemma 3.6. Suppose that in the reinforced random walk, we visit v and 
V* in infinitely often a.s., and let Tn he the nth time we visit either state. 
The process Y^^ is a mixture of Markov chains. Furthermore, if Dn is the 
ratio of the number of visits to v* and v by Tn, Dn converges a.s. to a finite 
limit D^. 

Proof. We claim that if 1^ is partially exchangeable, so is Y^^. It is 
sufficient to show that the probability of a sequence Yr^ is invariant upon 
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block transpositions, which generate the group of permutations that preserve 
transition counts ( [8] , Proposition 27) . The probabihty of a path Vj--^, . . . , Vr„ 
in is the sum of the probabihties of all paths vo,vi, . . . ,Vr„ in Yn that 
map to it. Denote this set of paths @. After a transposition of f -blocks or 
t;*-blocks, the probability of the path in Yt-,^ is equal to the sum of the prob- 
abilities of a different set of paths G' in Yn. However, it is easy to see that 
this transpostion of u-blocks or t!*-blocks defines a bijection from to G', 
and the probability of each path and its transposition is the same, because 
Yn is partially exchangeable. Therefore, is partially exchangeable. Fur- 
thermore, we assume that v and v* are recurrent, so by de Finetti's theorem 
for Markov chains is a mixture of Markov chains with a unique measure 
fi on the space of 2 by 2 transition matrices [8]. Note that both states are 
recurrent with probability 1, so the subset of transition matrices where one 
of the states is transient has /x-measure zero. This implies that ^-a.s. the 
transition matrix is irreducible, and since the state space is finite, both states 
are positive-recurrent. Therefore, Dn converges a.s. to a finite limit. □ 

Proposition 3.7. The reinforced random walk of order r traverses eve- 
ry edge u — )• u with w{uv) > infinitely often, almost surely. 

Proof. As X is finite, we must visit at least one state in A"" infinitely 
often, so without loss of generality, let this state be v. Let be the nth 
time we visit v, and Tn be a(Yi, . . . ,Yj-^). For u with v,u admissible and 
w{vu) > 0, let An be the event that 1"t„-i-i = u. Also, let pn = Qw,vo{^n\J^n)- 
By Levy's extension of the Borel-Cantelli lemma (Lemma A. 2), 

lim S^iliiii = 1 on|y;p^ = ool. 

Therefore, to show that the transition v ^ u is observed infinitely often 
with probability 1, it is sufficient to show that X^^Pm = oo a.s. The condi- 
tional probability pm is just w'^^^^ivu) / w'l^{v) . Let i?m,fc be the event that 
we observe v* fewer than km times between ti and r^. On Bm,k, we can 
lower-bound p^ using the minimum possible value of w'^^{vu), which is its 
initial value, and the maximum possible value of w"^{v), which is {k-\-l)mc. 
Thus, 

Pm — Qw,vo{Am n Bjn^k\-^ n) ~l~ Qw,vo{A.m H kl"^ ^) 

w'^^ (vu) 



+ l)mc' 



Now, consider the event {-Dqo < -^}- On this set, for any k > N, we will 
be in Bm,k for all but finitely many m, which implies 'YlmP'"^ ~ the 
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previous inequality. But, by Lemma 3.6 we have Qw,vo{Doo < 00} = 1, so 
noting {Doc < 00} = UAfgNi-^oo < ^} conclude that YlmPm = 00 Qw,vo- 
a.s., and Am happens infinitely often. Since w defines an irreducible Markov 
chain, the proposition follows by induction. □ 

Propositions 3.7 and 3.5 are sufficient to show by de Finetti's theorem for 
Markov chains [8] that the reinforced random walk of order r is a mixture 
of Markov chains on Af^, or 



where P^^ is the distribution of a Markov chain started at vq and parametri- 
zed by the matrix T, T is the space of x X"^ stochastic matrices and 
(l)w,vo is a unique measure on the Borel subsets of this space. Let T' '^T he 
the set of matrices that represent irreducible, reversible Markov chains of 
order r. 

Proposition 3.8. The reinforced random walk of order r is a mixture 
of reversible Markov chains of the same order, or (pw,vo{T'') = 1- 

Proof. This is a special case of Proposition 4.6. □ 

4. Variable-order, reversible Markov chains. The number of parameters 
of a Markov chain grows as \X\^ with the order, r, which renders higher- 
order models impractical in many statistical applications. In this section, we 
investigate a family of models with finite memory length which do not suffer 
from this curse of dimensionality. 

Definition 4.1. A variable- order Markov chain is a Markov chain of 
order r with the constraint that for every history h in the set ^ {v € 
X'^ : g < r}, if two states u,u' & X'^ both end in /i, the transition probabilities 
p{v\u) and p{v\u') are equal for every v € X"^ . 

In essence, this is a discrete process which upon reaching a sequence 
h G ^ loses memory of what preceded it. When M' is empty, we recover 
a general Markov chain of order r. Variable-order Markov chains have proven 
useful in applications where there is long memory only in certain directions. 
The literature on the subject can be traced to Rissanen [15] and Weinberger 
[17], who developed tree-based algorithms for estimating the set of histo- 
ries efficiently in the context of compression. Biihlmann and Wyner proved 
several consistency results on these algorithms [7], and the former later ad- 
dressed the problem of model selection [6]. For an evaluation of different 
algorithms in applications, see [4]. 
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It is worth noting that MacQueen mentioned variable-order Markov chains 
in an unpubUshed abstract. However, there is a marked difference between 
his definition and Biihlmann and Wyner's, which relates to the closure prop- 
erties of M' . MacQueen requires that if h is in J^, then so are all the se- 
quences that begin with h. Intuitively, this means that the process cannot 
recover memory once it is lost. Biihlmann and Wyner do not impose this 
constraint. However, this is guaranteed when the process is reversible. 



Proposition 4.2. Let Xn,n £ N, be an irreducible, reversible, variable- 
order Markov chain with histories J^. If h £ .y^ , then h* is also a history; 
additionally, any sequence that has h as a prefix is also in ^ . 



Proof. Let be the stationary law of the chain. If /i S J^, then for 
any pair a, 6 € A"?, where q and the length of h sum to r, Pt^{X\^ . . . , X^+g = 
ahb\X\ , . . . , = ah) is independent of a, or 

Pn{ah) 

This implies 

PAhb) _ Eaex'^ PAahb) _ E.s;,. P^(a/i)C _ P^jahb) 
Pnih) Y.a<^xi Pniah) T^aaXiPA^h) Pniah)' 

Using the fact that P,r is invariant upon time reversal and rearranging fac- 
tors, we obtain 

P^{b*h*a*) Pn{h*a*) 



P^{b*h*) Pn{h*) 

The left-hand side is equal to P-,r{Xi, . . .,Xr+q = b*h*a*\Xi, . . . ,Xr = b*h*), 
which by the previous identity is independent of b* . As this is true for any a € 
Af*^, h* must be a history in J^. To prove the second part of the statement, 
suppose /i is a prefix of g. Since h* is in Jff, and g* ends in h* , then by 
definition g* G J^. Using the first result, we conclude that g € J^. □ 



We will define a reinforcement scheme, which like the one in the previous 
section is recurrent, partially exchangeable and, by de Finetti's theorem, 
a mixture of Markov chains. But, in this case, the mixing measure is re- 
stricted to the variable-order, reversible Markov chains with a fixed set of 
histories J^. As before, we begin with a stationary, reversible function w, 
an initial state vq € Af^, and a palindromic sequence /3 that starts with vq. 
Let the function / : X^' i-^ Jif map any sequence to its shortest ending in J^. 
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Definition 4.3. The variable- order, reinforced random walk is a stochas- 
tic process Zn,n G N, on with measure Hy^^.y^. The initial state is vq with 
probabihty 1. For any admissible path vq, . . . ,Vn, the conditional transition 
probability 



J^w,vo[^n+l — U\Aq — Vq, . . . ,Zn — Vn 



<(/K)) 

whenever Vn,u is admissible and zero otherwise. 



Remark 4.4. This process is a reinforced circuit process, just like the 
one defined in Remark 3.3, with the difference that in computing the tran- 
sition probabilities, instead of taking the current state to be the sequence 
Vn S , we let it be the shortest ending of Vn in or f{vn)- 



Proposition 4.5. The variable- order, reinforced random walk is par- 
tially exchangeable in the sense of Diaconis and Freedman. 



This proof is deferred to the Appendix. One can show that this process 
is recurrent following the same argument of Proposition 3.7. In the proof of 
Proposition 3.7, we use a shortest history h in place of v, and Lemma 3.6 
still holds for h and h* . Recurrence and partial exchangeability imply 

(7) Hy,^y^{vo, ...,Vn) = j^Pj^{vo,...,Vn) dlpnj,vo{T) 

for a unique measure ipw,vo characterized by the function w, and the initial 
state, in addition to the parameters /3, c and which we keep fixed. In the 
Appendix, we show that 'ipw,vo is restricted to the reversible, variable-order 
Markov chains with histories 



Proposition 4.6. Let T" be the set of transition matrices repre- 
senting an irreducible, reversible, variable- order Markov chain where every 
h G is a history. Then, tpw,voiT") = 1. 



5. Bayesian analysis. In Section 3, we defined a family of measures in 
the space of order-r, reversible Markov chains, and in Section 4 we extended 
it to variable-order, reversible Markov chains. In the following, we will show 
that these distributions are conjugate priors for a Markov chain of order r. 
We discuss properties of the prior relevant to Bayesian analysis, such as 
a natural sampling algorithm and closed-form expressions for some impor- 
tant moments. 



VARIABLE-ORDER, REVERSIBLE MARKOV CHAINS 



13 



Definition 5.1. Consider a variable-order, reinforced random walk 
n G N, with distribution Hyj^y^^ and take any admissible path e = vq, . . . , t^n- 
We define %n G N, to be the process with law 

= Hw,vo {Zn+1 = "Ul, . . . , Zn+m+l = Ura\Zi = t;i, . . . , = Vm)- 

In words, Z^^^ is the continuation of a variable-order reinforced random 
walk after traversing some fixed path e. We can rewrite the law 



(8) 



HyjjVQ ('^1 ) • • • ) 'Ul 5 • • . 5 



m ) 



Hw,vo (^1 1 • • • 1 '^n) 

which makes it evident that Z^^'^ is partially exchangeable, because for a fi' 
xed e, the numerator only depends on the transition counts in ui,...,u. 
while the denominator is constant. It is also not hard to see that the pro- 
cess visits every state infinitely often with probability 1. Therefore, by de 
Finetti's theorem for Markov chains, it is a mixture of Markov chains with 
a mixing measure that will be denoted ipw,vo,e- 

Proposition 5.2. Suppose we model a process Wn,n S N, as a re- 
versible, variable- order Markov chain with histories M' C {z; £ X'^:q < r}, 
and we assign a prior 'fpw^vo the transition pvobabilities, T . Given an 
observed path, e = vo,...,Vn, the posterior probability of T is ipw,vo,e- In 
consequence, the family of measures 

= {i^w,vo,e ■ e o,n admissible path starting in vq} 

is closed under sampling. 

Proof. Consider the event Wn = Vn, Wn+i = ui, . . . , Wn+i+m = Um- By 
Bayes rule, the posterior probability of this event given the observation is 
the prior probability oi Wi = vi, . . . ,Wn = Vn, Wn+i = ui, . . . , W„+i+m = Um 
divided by the prior probability of Wi = vi, . . . , Wn = Vn- By equation (8), 
this posterior is equal to Hw,vo,e- Let p(T) be the posterior distribution of T 
given the observation, then for any ui, . . . , Um and any m > 0, 



)dp{T). 



'r 

By de Finetti's theorem for Markov chains, the mixing measure ilJw,vo,e is 
unique; therefore, we must have p = ipw,vo,e- D 

In the next proposition, we show that the variable-order, reinforced ran- 
dom walk may be used to simulate from the conjugate prior ijjw,vo (or using 



(0 (*) 
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a similar argument, a posterior of the form ipw,vo,e ). Let {y« 
^n^}iG{i,...,fc} be independent samples of the reinforced random walk with 
initial parameters w and vq. For any sequence u € X^'^^ , consider the ran- 
dom variable n~^w'^{u), the weight defined in equation (4) for a sample 
path with distribution H^j^vo^ normalized by the path's length. Define the 
empirical estimate, n~^w'^ fcl^)) to be the mean of this random variable eval- 
uated at the paths {V^'-*^}ie{i,. ..,*,■}• Also, let Pj be the stationary law of an 
order-r Markov chain with transition probabilities T. We have seen that 
{Pj{u) :u S A'''"*"^} has a one-to-one correspondence with T. 

Proposition 5.3. For any bounded, real-valued function g{Pj{-)), 

(9) hm hm g{{n-'w'^^i,{u):ue X^+'}) '^=" / g{P^) d^^^,,{T). 

Proof. The empirical estimate g{{n"^w'^ f^{u) : u G X^~^^}) is the aver- 
age of i.i.d. observations, so by the strong law of large numbers, w.p.l, 

hm g{{n-'w'^^^{u):ue X''+'}) = H^^,,[g{{n~'wM :u e X'-+'})], 

where the right-hand side is the expectation in a reinforced random walk 
with parameters w, vq. In the proof of Proposition 4.6, we showed that u)^(n) 
converges Hw,vo-^-^- Taking the limit as n — >■ oo, by dominated convergence, 

hm hm g{{n-'w'^^,{u):ueX''+'}) 



lim g{{n-^wM-ueX''+^}) 



Conditional on a variable T measurable on its tail c-field with distribution 
il'w,vo ) tbe reinforced random walk is a Markov chain with law Pj^ . We know 
w'^{u) converges P^-a.s. to Pj{u), so equation (9) follows. □ 



Several moments of Hy^^^^ have closed-form expressions. In particular, the 
mean likelihood of any path beginning in vq is just the probability of 
the path in the reinforced random walk by equation (7). From the proof of 
Proposition 4.5, one can deduce a closed-form expression for the law of the 
variable-order reinforced random walk as a function of the transition counts 
in a path (see Supplement [2]). From a realization of the transition counts 
as a path, one can also compute the law H^^y^ by modeling a random walk 
with reinforcement. 

The expectation of cycle probabilities with a, prior ipw^vQ may also 

be computed exactly. 
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Proposition 5.4. For any cyclic path v,vi, . . . ,Vn,v, not necessarily 
including vq, the expectation of {v,vi, . . . ,Vn,v) with prior ipw^vo onT has 
a closed-form expression, provided Wq{u) is greater than 3c for all u € . 

Proof. Find the shortest cycle v, . . . ,vo, . . . ,v with positive weight w. 
Then, for any transition matrix T in the support of ipw,vo, we have 

N Pvoiv,Vl,...,Vn,V,...,Vo,...,v) 
(10) [V,Vi,...,Vn,V) = -jT: ^ . 

Taking the expectation with 

(T) 

r 

P^^^{v,Vl,...,Vn,V,.-.,VO,...,v) 

IT Pi^[v,...,VQ,...,v) 

By Bayes theorem, the product of the hkehhood Pj^{v,vi, . . . ,Vn,v, . . . ,vq, 
. . . ,v) and the prior dip^^y^^ (-^) equal to the marginal prior probability of 
the path v,vi,. . . , Vn, v, . . . ,vo, . . . ,v times the posterior of T: 

(T) 

r 

= H^^yg{v,vi,...,v,...,vo,...,v) I —jT-, ^- rdil)^ y^{T), 

where Wp are the weights parametrizing the posterior of T given the path 
v,vi, . . . ,v, . . . ,vo, . . . ,v. To solve the integral on the right-hand side, let us 
rewrite it using Bayes theorem and equation (7), 

H~lp,voiv, ■ ■ ■ ,vo, . . . ,v, . . . ,vo, . . . ,v) 

Ploi'",---,vo,---,v,...,vo,...,v) 



IT P4^{v,...,Vo,...,v) 

where Wpp are the weights Wp reduced by the cycle v, . . . ,vo, . . . ,v, . . . ,vo, . . . ,v. 
These weights are positive because of the assumption Wq{u) > 3c for all u, 
which could certainly be relaxed in some cases. Applying equations (7) and 
(10) once more, the last expression becomes 

Hwlp,vo {v,...,vo,...,v,...,vo,..., v)Hyj^^^y^ {v,...,vo,...,v), 
which completes our derivation. □ 



The ability to compute these expectations exactly makes it possible to 
use Bayes factors for model comparison [11]. Given some data X and two 
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probabilistic models, where each model i has a prior measure and pa- 
rameters 9i, a Bayes factor quantifies the relative odds between them. It is 
formally defined as, 

PW(X) fpW{X\6,)dPW(9,) 

^ ' P(2)(X) /P(2)(X|02)'ii^(2)(^2)' 

the ratio between the marginal probabilities of the data under each model. 
Each marginal probability is sometimes referred to as the evidence for the 
corresponding model. Diaconis and Rolles apply Bayes factors to compare 
a number of models on different data sets. They consider reversible Markov 
chains, general Markov chains, and i.i.d. models [10], assigning conjugate 
priors which facilitate computing the marginal probabilities in equation (11). 

The conjugate priors introduced here facilitate similar comparisons, where 
the family of models under consideration is expanded to include reversible 
Markov chains that differ in their length of memory. For some data X, 
one can define two variable-order reversible Markov models, with different 
histories, J^^^^ and J^(2)_ each case, we assign a conjugate prior, ^pw^vo 

(2) 

and tpw'vo, respectively, to the transition probability matrix. To make the 
prior uninformative in some sense we could set w to be uniform for all 
u G and let /3 be the shortest palindrome starting with vq, for example. 
The constant c is set to 1. The Bayes factor is then 

pW(x) _ /^p,^(x)#il(r) 
P^'H^) /^n^(x)#il(r)- 

We have seen that the expectations on the right-hand side can be com- 
puted exactly when X is a path starting at vq or any cyclic path. In the 
following example, we apply this test to finite data sets simulated from 
a lumped Markov chain. 

Example 5.5 (Order estimation for a lumped reversible Markov chain). 
A random walk was simulated on the 9-state graph shown in Figure 4, 



State 1 




3 J 0.3 0.3 VO 



State 2 
(slowly mixing) 




State 3 



Fig. 4. A lumped reversible Markov chain. 
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^"B\pm(x)) ^"S\p(^\x)j ^"&l,p(t)(X)^ 

Fig. 5. Boxplot of logarithmic Bayes factors computed from 50 independent datasets. 

from which we omitted self-edges on every state, all weighted by 1. The 
observation was lumped into the 3 macrostates separated by the dashed 
lines. This is meant to illustrate a natural experiment, where the difference 
between the states within each macrostate is obscured by the measurement. 
From the resulting sequence, we take the initial macrostate and every 7th 
macrostate thereafter to form a path X of length 1000 in X = {1, 2, 3}. 
We test 4 reversible Markov models, that differ in the length of memory: 

1. A first-order, reversible Markov chain. 

2. A second-order, reversible Markov chain. 

3. A variable-order model with maximum order 2, where states 1 and 3 are 
histories. Intuitively, only state 2 has "memory." 

4. A variable-order model with maximum order 2, where states 2 and 3 are 
histories. Intuitively, only state 1 has "memory." 

(i) 

For each model i, we assign a prior V'«i,«o transition matrix, where 

vq is the initial state in X, w{u) = 2 for all u G and /3 is the shortest 
palindrome starting with vq. We compared the 4 models using 50 indepen- 
dent realizations of the lumped Markov chain and found that model 3 had 
the highest evidence in 72% of the cases, while model 2 was selected in all 
the remaining cases. In Figure 5, we report a boxplot of the logarithm of 
the Bayes factors comparing models 1, 2, and 4 against model 3. 

This represents compelling evidence for model 3. The result is not entirely 
surprising given that this model gives memory to state 2, which is slowly 
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-180 -90 90 180 

4> 



Fig. 6. The structure of Ace-Ala-Nme is described by two dihedral angles, <j> andtp. The 
periodic map on the right shows a partition of conformational space into 5 states. The 
colored markers indicate the free energy of bms centered at each pomt, which reveals the 
metastable nature of this molecule 's dynamics. 

mixing, as indicated in Figure 4. The fact that the most complex model 
(model 2) is not necessarily selected showcases the automatic penalty for 
model complexity in Bayes factors. 

We conclude this section with two applications of Bayesian analysis of 
reversible Markov chains to molecular dynamics (MD). An MD simulation 
approximates the time-reversible dynamics of a molecule in solvent. The 
trajectories produced by a simulation are discretized in space and time. 

Example 5.6. The terminally blocked alanine dipeptide, shown in Fig- 
ure 6, is a common test system for Markov models of MD. The conforma- 
tional space of the molecule, which is represented in the figure in a two- 
dimensional projection, is partitioned into 5 states. The states are believed 
to be metastable due to the basins that characterize the free-energy func- 
tion, also plotted in the figure. This metastability allows one to approximate 
the dynamics of the molecule, projected onto the partition, as a reversible 
Markov chain. The approximation will be good when the discrete time in- 
terval at which a trajectory is sampled is larger than the timescale for equi- 
libration within every state, but smaller than the timescale of transitions. 

Few statistical validation methods are available for Markov models of MD. 
Bacallado, Chodera and Pande used a Bayesian hypothesis test to compare 
different partitions of conformational space [3]. Here, we apply Bayes factors 
to test a first-order Markov model on a fixed partition, by comparing it to 
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second-order and variable-order models on the same partition. The data X 
are the transition counts in a single MD trajectory of 1767 steps sampled at 
an interval of 6 picoseconds, as recorded in Table 1. The prior parameters 
w and /3 are the same as in the previous example. The results of the model 
comparison are summarized in the following table. 



Model (i) 


logP(»)(X) 


First order 


-1846 


Variable order 


-1824 


Variable order 1 


-1825 


Variable order 2 


-1844 


Variable order 3 


-1846 


Variable order 4 


-1847 


Second order 


-1800 



The state describing each variable order model is the only state in the 
model that has a memory of length 2 (the only state that is not a history). 
There seems to be substantial evidence in favor of a second-order model. 
Adding memory to states seen in a large number of transition makes a bigger 
difference, as expected. This result is in accordance with certain exploratory 
observations which indicate that at the timescale of 6 picoseconds, the effect 
of water around the molecule, neglected in our state definitions, persists. 
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Example 5.7. The alanine pentapeptide is a longer polymer that ex- 
hibits a higher degree of structural and dynamical complexity. Buchete and 
Hummer partition the conformational space of the molecule into 32 states 
by chemical conventions [5]. An MD trajectory^ in conformational space was 
projected onto this partition, and an exploratory analysis suggested that the 
effects of memory decay after 500 picoseconds. Accordingly, we take a con- 
formation from the trajectory every 500 picoseconds to form a sequence X 
of 1885 steps in A" = {0, . . . , 31}. 

As in previous examples, we tested models with varying lengths of memo- 
ry. Each model was assigned a conjugate prior, this time setting w{u) = 1/32 
for all u G . Of all the variable-order models where a single state has a 
memory of length 2 and all others are histories, we found that only 4 models 
where strongly selected over a first-order model. In the following table, we 
show the logarithm of the evidence for each of these models, a first-order 
model and a variable-order model that gives a memory of length 2 to all 4 
states. 

Model (i) logP^^^(X) 

First order -4090.0 
Variable order 14 —4015.5 
Variable order 15 —3814.5 
Variable order 30 -3860.3 
Variable order 31 —3301.6 
Variable order 14, 15, 30, 31 -2964.3 



This represents compelling evidence for a model that gives memory to 
states 14, 15, 30 and 31. It is interesting to contrast inferences based on 
this model to those based on a first-order Markov model. To do this, we 
computed 1000 approximate posterior samples of the transition matrix in 
each case. This was done by simulating a reinforced random walk, which is 
a mixture of variable-order Markov chains with the posterior distribution of 
T as a mixing measure (see Proposition 5.3). The reinforced random walk 
was simulated 10'' steps to obtain each sample. 

In Figure 7, we histogram stationary probabilities of the transition ma- 
trices sampled from the posterior. In particular, we show plots for the sta- 
tionary probabilities of states 14, 15, 30 and 31. In the variable-order model, 
we define n{x) = Yly^x '^i^u)- The inferences of each model in this case are 
very similar. 

The largest eigenvalues of the transition matrix are also of interest be- 
cause they are related to different modes of relaxation. Each eigenvalue A 
is associated with a timescale — riag/logA, which is useful in exploratory 



^Simulated with the Amber-GSs forcefield at 300K in exphcit solvent. 
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Fig. 7. Histograms of 1000 posterior samples of the stationary probabilities of states 14, 
15, 30, 31. The red solid lines correspond to the first-order Markov model, and the green 
dashed lines to the variable- order Markov model that gives a memory of length 2 to states 
14, 15, 30 and 31. 



analysis. Here, riag is the length in time of one step of the Markov chain, or 
500 picoseconds. In Figure 8, we histogram posterior samples of the three 
largest nonunit eigenvalues and their associated timescales. In this case, the 
inferences of each model are quite different, with the variable-order model 
predicting larger eigenvalues and timescales. 

6. Conclusions. We define a reinforcement scheme for the higher-order, 
reversible Markov chain that extends the ERRW on an undirected graph. 
Several properties of the ERRW, like recurrence and partial exchangeability, 
were shown to generalize to this process. Other properties may also general- 
ize but were not pursued here. In particular, we can mention the uniqueness 
results of Johnson [18] and RoUes [16], and the fact that mixtures of mea- 
sures in T> are weak-star dense in the space of all priors [10]. 

The reinforced random walk leads to a conjugate prior that facilitates 
estimation and hypothesis testing of reversible processes in which the ef- 
fects of memory decay after some time. Certain statistical problems remain 
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Fig. 8. Histograms of 1000 posterior samples of the second, third andfourth largest eigen- 
values of the transition matrix, as well as the timescales associated with these eigenvalues. 
The red solid lines correspond to the first-order Markov model, and the green dashed lines 
to the variable- order Markov model that gives memory to states I4, 15, 30 and 31. In both 
cases, we compute the eigenvalues of the transition matrix for the process Ki,n€N, in 
X . All sample means fi and standard deviations a are shown. 

a challenge, such as inferring the transition matrix with a fixed stationary 
distribution. In applications, it will become important to evaluate the ob- 
jectivity of the prior and to determine the optimal value of its parameters 
in this sense. 
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From a practical point of view, we only discussed Bayesian updating for 
data sets composed of a single Markov chain starting with probability 1 
from the initial state vq used in the prior. Numerical algorithms are needed 
to perform inference with data sets composed of multiple chains. A starting 
point could be the method developed by Bacallado, Chodera and Pande 
to apply the prior of Diaconis and Rolles to first-order, reversible Markov 
chains [3]. 

APPENDIX 

In the following, we use the notation defined in the first paragraph of 
Section 2. 



Proposition A.l (Kolmogorov's criterion). Let X„,neN, be an ir- 
reducible order-r Markov chain with transition probabilities p. Then Xn is 
reversible if and only if for any cyclic admissible path vq,vi, . . . ,Vn,Vo, 

(12) p{vi\vo)p{v2\vi) ■ ■ -pivolVn) = 1^1 |f^2) " " -PiKH)- 

Proof. The "only if" statement is straightforward. By the definition of 
the stationary distribution and reversibility 



p{vi\vo)p{v2\vi) ■ ■ -pivolVn) 



PnivQVl) Pn{viV2) PjriVnVo) 



7r{vo) tt{vi) 7r(u„) 



tt{v*) 7r{vl) 7r«) 

= PiVo\v*i)p{vl\v2) ■ ■ ■ piv*\v^). 

To prove the "if" statement, choose an arbitrary state u; then, for any v, 
since the chain is irreducible, there is an admissible path u,vi,V2, ■ ■ ■ ,Vn,v 
with positive probability. Define 

p{vi\u)p{v2\vi) ■ ■ -pivlVn) 



(13) tt'{v) = B 



p(<|t;*)p«„ilO • ■ ■p{u*\vl) 



where i? is a positive constant. Note that this expression does not depend 
on the sequence vi,. . . ,Vn chosen. Take a different sequence zi, . . . ,Zm- Let 
t € X^' be a palindrome, then because the chain is irreducible, we can find 
a sequence v,ti,t2, ■ ■ ■ ,t with positive probability, and it is easy to see from 
equation (12) that the palindrome v,ti,t2, ■ ■ ■ ,t, . . . ,t2,tl,v* has positive 
probability. We can construct another palindrome u* ,si,S2, ■ ■ ■ ,S2,sl,u in 
the same way. Multiplying equation (13) by factors of 1, 

^ p{vi,V2,...,v\u) ^ ^ p{vi,V2,...,v\u) p{ti,t2,...,V*\v) 

P(<, ...,u*\v*) p{vl, . . . , u*\v*) p{ti,t2, ...,v*\v) 
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^ P{Z^, ^m-l. • ■ ■ ,-^1'^*) pisi,S2, . . .,U\U*) 
p{zi,Z2, ■ ■ ■,v\u) S2, . . .,U\U*) 

p{zi,Z2,...,v\u) 



B 



p{zi,Z2,. ■ ■ ,v\u) 



The first four terms equal 1 because the numerator and denominator are 
the probabilities of the same cycle forward and backward, which are equal 
by equation (12). Now, we check that Tr'{v) satisfies the reversibility condi- 
tions specified in the Introduction. First, we show that tt'{v) =7r'{v*). Take 
a path u, zi, . . . , zi,v* with positive probability, and the previously found 
palindrome u* , si, S2, ■ ■ ■ , S2, sl,u, then applying the same method, 

N „ p{vi,V2,...,v\u) 

TT [V) = B 



B 



p{vi,V2, ■ ■ -^Vlu) p{si,S2, ■ . .,U\U*) 
<_l, ■.■,U*\V*) p{si,S2, . . . , U\U*) 

^ P{zl,zl_^, ...,U*\v) p{zi,Z2, ...,V*\u) 
p{zi,Z2, ...,V*\u) p{z^,Z^_-^, ...,U*\v) 

p{zi,Z2,...,V*\u) , 

p{zl,zl_-^,...,u*\v) 

From this, and equation (13) we deduce that for any admissible f , z, 'k'{v)p{z\ 
v) = tt' {z*)p{v*\z*). Since the state space is finite, we can choose B such that 
tt' sums to 1. We have shown that the weights k^^z = T^'{v)p{z\v) satisfy the 
conditions of a reversible random walk with memory, so by Proposition 2.3 
the process with transition probabilities p represents a reversible, order-r 
Markov chain. □ 



Proof of Proposition 4.5. The probability Hw,vo{vo, ■ ■ ■,Vm) is a pro- 
duct of transition probabilities, to which the nth transition contributes a fac- 
tor of 



(U) « - <-l(/(^n-l)^n) 

<_l(/(^'n-l)) 



We know that j{vn) cannot be longer than f{vn-i)vn by Proposition 4.2; 
let L{vn-i,Vn) be the set of histories of Vn that are shorter than f{vn-i). If 
this set is nonempty, let us multiply equation (14) by factors of 1, to obtain 
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the following factor for the nth transition: 




2GL(l'„_l,l>n) 



n 



w', 




where z"*" is the ending of Vn that is longer than z by 1. The added factor 
equals 1 because, if w'^_i{z'^) ^ w'^_i{z'^), then f{vn-i) must end on z, 
which by definition is a history shorter than a contradiction. 

Consider all the possible factors in the numerator of Hyj^yg{vQ, . . . ,Vjn)- 
Take any h € that is minimal, meaning that it does not end in another 
history. For any a (z X, we will see a factor w'{ha) after every transition 
through ha. The conjugate factor w'{ah*) will appear every time we go 
through ah*, because: 

• If A{ah*) € it is minimal by the closure properties of J^, so w'[ah*) 
will be the numerator of the first factor in equation (15). 

• Otherwise, the minimal history in the transition ending in ah* will be 
longer than h* , and there will be an added factor in equation (15) with 
w'{ah*) in the numerator. Conversely, note that the factor w'{ah*) is only 
added to the numerator of equation (15) when we go through ah* for some 
minimal h, because we required that h* G L{vn-i,Vn), so h£ and does 
not end in another history. 

As in the proof of Proposition 3.5, we argue that every new factor w'{ha) 
or w'{ah*) is increased by c with respect to the previous one (or by 2c if 
ha is a palindrome). Therefore, the numerator of H^^^^ivo, . . . , Vm) is only a 
function of the transition counts and the initial state. 

Finally, consider all the factors in the denominator of Hyj^yQ{vQ, . . . ,Vm)- 
Take any minimal history h. We will see a factor w"{h), for every transition 
through h. The conjugate factor w"{h*) will appear every time we go through 
h* , because: 

• If h* is also minimal, then w"{h*) will be in the denominator of the first 
factor in equation (15). 

• Otherwise, we know that A{h*) is not a history, so the transition ending 
in h* must have a history at least as long as h* , which is longer than 
the history Q{h*). So, w"{h*) will appear in the denominator of a factor 
added in equation (15). Conversely, we only add factors of w"{h*) to the 
denominator of equation (15) when we go through h* for a minimal h, 
because we required Q{h*) G L{vn-i,Vn) which implies h minimal. 

As before, every new factor w"{h) or w"{h*) will be increased by c with 
respect to the previous one (or by 2c if /i is a palindrome). Therefore, the 
denominator is a function of the transition counts and the initial state, and 
the process is partially exchangeable. □ 
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Proof of Proposition 4.6. Let Cniu,v) be the transition counts 
from li to u in the first n steps of a stochastic process on X'^ . Also, de- 
fine Cn{u) = ^^^xr Cn{u,v) , which counts the visits to u. Remember T" 
is the set of irreducible transition matrices for variable-order, reversible 
Markov chains where all h G Jif are histories. Define the event D, that the 
set {Cn{u,v)/Cn{u) ■.yu,v admissible} converges to a transition probability 
matrix in T" ■ 

From the recurrence of the variable-order, reinforced random walk and 
equation (7), it is evident that the set of irreducible Markov chains has mea- 
sure 1 under ipw,vo- this set, the variables {C„(ti, v)/Cn{u) : Vti, v admissible} 
converge almost surely to the transition probabilities, so for any T ^ T" irre- 
ducible, Pj^^{D) = 0. Furthermore, by Lemma A. 3, D happens almost surely 
in the variable-order, reinforced random walk. Putting this into equation (7), 
we have 

H^,,, {D) = l= [ Pi {D) d^P^,,, (T) < / (T) , 

which implies the proposition. □ 

Lemma A. 2 (Levy). Consider a sequence of events Bj^ G G N, in 
some filtration {J-k}- Let bn = Y12=i '^B„ total number of events oc- 

curring among the first n, and let Sn = X]fc=i ^(^fel-^fc-i) the sum of the 
first n conditional probabilities. Then, for almost every uj: 

• If Sn{oj) converges as n^oo, then bn{uj) has a finite limit. 

• If Sn{oj) diverges, then bn{uj)/sn{uj) ^ 1. 

Lemma A. 3. H^^voiD) = 1. 

Proof. For any u in {X'^:q <r + 1}, the variables n~^w'^{u) and 
n~^w'l^{u) are functions of {n~^(7„(n, -y) : Vn, f admissible}, therefore they 
converge almost surely, because the reinforced random walk is a mixture of 
irreducible Markov chains for which the latter converge. The reinforcement 
scheme defined in Definition 4.3 imposes some constraints on the limits of 
n~^w'^{u) and n~^w'^{u). Note that w"(ti), w"(u*), w'^i^u) and w!^{u*) never 
differ by more than c; we also know that the reinforced random walk is posi- 
tive recurrent (it is a mixture of irreducible, finitely- valued Markov chains), 
so almost surely 

lim n"^w'^{u) = lim n~^w'^{u*) = lim n~^w'j^{u) 

n— >oo n-^oo n— ^oo 

(16) ^ , 

= lim n Wn{u*) > 0. 
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Denote this limit u)oo(it) = w^oo It is also easy to see that if ti € X'', then 
for all s > q, 

(17) ^ Wooiv) = Wooiu). 

{vdiX" : V ends in u} 

Now, let r„ be the nth. visit to n € X'' and let i?^ be the event that we 
make a transition to v at Tn- Define 



Pn{f{u),v) = Hw^y^,{Bn\cF{Yl, ...,Yr„)) 



<(/(^)) ■ 



We know Pn{fiu),v) converges a.s. to Woo{f{u)v)/woo{f{u)) > 0. There- 
fore, YlinPriifW)^'^) = oo a.s., and by Levy's extension of the Borel-Cantelli 
lemma (Lemma A.2), 

1 1r 

^'"=^ ^ 1 a.s. 



Em=i;^m(/(«)>^') 



hm - V 1b„ = hm ^ 



CA:(n,-t;) _ Woo{f{u)v) 



m=l 



k^oo Ck{u) Woo{f{u)) 



This means that {C„,(n,f)/C„(u) : Vn,u admissible} converges if^^^p-a.s. to 
a set of transition probabilities, Woo{f{u)v)/woo{f{u)), for a variable-order 
Markov chain with histories M' . To show that this Markov chain is reversible, 
note that Woo is the stationary distribution, because 



""^^""^ w (f(u)) = ^ ^ '^^^^''^WW 

«GA"-: j ffcejr minimal: ]_ {«:/H=M ^ 

n, I) admissible J [ h, i; admissible 



h S minimal: ^ 
h, V admissible 

where we used equation (17) in the last two identities. By equation (16), 
Woo satisfies the conditions for reversibility. Therefore, Hu],vq{D) = 1. □ 
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SUPPLEMENTARY MATERIAL 

Law of a variable-order, reinforced random walk 

(DOL 10.1214/10- AOS857SUPP; .pdf). We provide a closed form expression 
for this law as a function of transition counts and suggest how it could be 
useful. 
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